The objective of this work is, on the basis of a competition organised on Github, the M5 forecasting competition, to make the most accurate possible predictions of sales over 28 days in a Walmart store. By analyzing the sales over a period of time from 2011 to 2016, the objective is to find a model, which will make the most accurate predictions possible, based on the knowledge we acquired in the forecasting course 1. To achieve this goal, we will first familiarize ourselves with the data in the exploratory data analysis part. We will try to understand the different patterns in the data, identify possible trends and seasonalities. As the data is hierarchical (we focus on the TX1 store, which is divided into 3 departments, which are further divided into 2 or 3 categories) we will test several approaches to see what gives us the best results. firstly, we will make a hierarchical decomposition of these data (going at most to the 2nd level, at category level) and taking into account the other stores in the country by using different methods (Bottom-up, Top-down, Middle-out, etc) as well as different methods (arima, ETS). secondly, we will try a second approach by analyzing only the store we are interested in, without taking into account the hierarchy.
Once all these possibilities have been tested, we will select the best model according to certain criteria explained below, and we will make the predictions for the 28 days.
To evaluate our different models, we have divided our data into 3 parts. Historical data is provided from 2011-01-29 to 2016-06-19, but a part of these days contained only NAs. So we took all the days without NA, from which we removed 28 days to do a validation set and 28 days for a test set. The figure below summarizes our splitting.
We can see that CA represents more than 40% of sales, while WI and TX are about the same.
We can see that the two stores with the most sales are in CA, which is not surprising given the previous plot. Our store, TX_1 is average-low in terms of sales.
Now, let’s focus on our store, TX_1
We can see that food represents the vast majority of sales with more than 60%, followed by household. Hobbies represent less than 10% of sales.
Not surprisingly, this is one of the food departments that accounts for the vast majority of sales. Household_1 accounts for most of the sales of the household department.
We can see that the store makes more sales during the weekend.
all years appear similar in terms of average sales, although there are variations between years. There are some common points to all years, such as lower median sales in January, a peak around August and a drop between October and November. Note that not all products were on sale from January 2011, which is probably why 2011 is lower than other years.
this plot is to see if there seems to be a trend in the applied prices. Average prices are lower at the beginning, but this is probably because not all products were sold by 2011.
In this case there doesn’t seem to be any particular trend.
We have selected product "FOODS_3_794 in particular to show the possible price variations over time. The three downward peaks are probably discounts or an error in the data.
there’s a majority of national and religious events. But it would be interesting to know what events there are in particular during the 28 days that we have to predict.
The only events that took place during this 28-day period in previous years are: “NBA Finals”, “Independence day”, “Ramadan start” and “Father’s day”. Looking at the date of these events in 2016, only “Independence day” will take place within this 28-day window.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 2925.6242 | 68.44472 | 42.744339 | 0.000000 |
| event_name_1 == “IndependenceDay”TRUE | 112.1758 | 379.85276 | 0.295314 | 0.768157 |
if we do a linear regression of independence day on sales, can see that it does not have a significant impact on sales.
When you look at sales by day, there seems to be an annual seasonality and a growing trend. Moreover, One day a year the shop is closed for Christmas.
If we decompose the time series by category, the seasonality we observed seems to come from FOOD while the trend is more towards hobbies and household.
The beginning of the household time series is marked by a slope much higher than the rest. This is probably due to the introduction of new products which creates a bias in the interpretation of the trend.
The trend is clear, however, the seasonality is not yet fully decomposed. As we have daily data, we probably have several seasonalities. In what follows, we try to decompose it manually
This correlogram shows 3 years. There are 3 waves with a peak approximately every 320 days which suggests a possible annual seasonality.
This correlogram shows six months. There are 6 waves that can be clearly distinguished when ACF are below 0. This indicates a monthly seasonality.
This correlogram shows 30 days. There is a peak every 7 days showing a weekly seasonality.
This EDA helps us to familiarize ourselves with the general data and more specifically with our TX_1 store. As far as sales are concerned, there’s a growing trend. It is difficult to analyse it because the number of products in the stores is not constant.
The weekly and the monthly seasonality are clear. The problem with a monthly seasonality is its lack of consistency, as not all months have the same number of days. Using a frequency of 30 or 31 would create a bias. Thus, we will not take it into account in our models.
Concerning the annual seasonality, it seems implausible that every day of the year is correlated with the day of the previous year.
However, as shown by the closing of stores at Christmas, we believe most of the correlation between years comes from special events that are repeated.
Therefore, we will include in our models Christmas who has a big impact on sales (in fact, the stores are closed) and the only event that has taken place during the 28 days that we predict which is Independence Day.
In this part we will use a frequency of 7 because there is a weekly seasonality and the models will run faster.
To avoid any problem related to the frequency used when creating the time series, we decided to use brackets.
The data at our disposal contains 1969 days (lines); however, from line 1914, the data is only composed of NA.
We decided to remove 56 days (2x28) from the training set. This allows us to measure the predictions with a validation set of 28 days and still have a test set.
According to the paper Another look at measures of forecast accuracy (R. Hyndman, A. Koehler), using the MASE would ensure us easily applicable metric among our methods.
Moreover, being the mean absolute error of the forecast values, divided by the mean absolute error of the in-sample one-step naive forecast, it allows us to use it as a benchmark.
Thus, a MASE with a value higher than one would mean that our model is less attractive than in-sample one-step forecasts from the naive method and thus should not be considered.
This metric is easy to obtain when we use the forecast function however, in the second part we will use auto.arima which does not automatically return the MASE.
Thus, this second part will be analyzed separately using the RMSE.
Note that we could have used the RMSE everywhere but the way we got the metrics from auto.arima is particular. We differentiated the data before using it to model an arima.
In order to have a metric easily we preferred to apply this difference also on the training set instead of the traditional way of removing the difference from our prediction.
Having to analyze the TX1 store, we had started by trying to modelize the sales of this store using all the data.
We were soon confronted with the problems of the lack of power of our laptops.
However, we were not sure that the code and method used was the right one and to better understand how the hts and forecast functions work, we decided to conduct an analysis only on HOBBIES_2 which is the department with the fewest items in the store.
We only managed to make the following prediction: A model using the bottom-up arima method.
Top level : Hobbies_2| HOBBIES_2 | HOB2010TX1 | HOB2016TX1 | |
|---|---|---|---|
| ME | -7.353 | 0.027 | 0.035 |
| RMSE | 12.955 | 0.427 | 0.189 |
| MAE | 11.359 | 0.354 | 0.036 |
| MAPE | 50.439 | Inf | Inf |
| MPE | -42.279 | -Inf | -Inf |
| MASE | 1.264 | 3.689 | 0.122 |
The model is not good, the MASE of HOBBIES_2 is higher than 1, however, using an arima for some articles could be a good option as we can see with the article: HOB2016TX1.But for others the model doesn’t work at all. Moreover, the problem remained the same: we won’t be able to compute more than 3,000 different models for each item sold at Walmart. If we had more time we could have taken each item separately to find the best possible model. We probably would have chosen an arima for HOB2016TX1 while we would have looked for another model for HOB2010TX1 before aggregating them all.
In order to analyze all the sales of the store, an analysis stopping at the level of each department is more reasonable.
Thus we have modified our code to aggregate the sales of each item at the level of their respective department.
We computed 5 different models two with a bottom-up and top-down approach (ARIMA Model and Exponential Smoothing State Space Model). We add to the ARIMA model with bottom-up approach regressors corresponding to Christmas and Independance day.
Top level : TX1| ARIMA | ETS | ARIMA | ETS | mint - shr | |
|---|---|---|---|---|---|
| ME | 636.470 | 417.965 | 590.267 | 438.109 | 410.373 |
| RMSE | 1258.891 | 1508.708 | 1467.186 | 1514.347 | 1502.121 |
| MAE | 961.835 | 1163.264 | 1121.520 | 1160.847 | 1158.583 |
| MAPE | 10.222 | 12.555 | 12.010 | 12.495 | 12.513 |
| MPE | 5.729 | 2.351 | 4.680 | 2.590 | 2.270 |
| MASE | 0.803 | 0.972 | 0.937 | 0.970 | 0.968 |
The ARIMA model using a bottom-up approach gives the best results.
We realized using the top-down method before, that by limiting the top level to TX1 we were losing information that could be taken into account. Furthermore, using all the stores in Texas as well as all the stores in the US would not add so much extra model to compute for the computer.
Thus we have re-tried the models used previously.
Top level : all stores| ARIMA | ETS | ARIMA | ETS | mint - shr | |
|---|---|---|---|---|---|
| ME | 214.313 | 77.068 | 187.703 | 1265.745 | -680.343 |
| RMSE | 920.166 | 1072.454 | 959.829 | 1681.506 | 1273.730 |
| MAE | 748.278 | 920.409 | 783.848 | 1318.651 | 1102.523 |
| MAPE | 8.197 | 10.200 | 8.632 | 13.571 | 13.059 |
| MPE | 1.360 | -0.529 | 1.027 | 12.853 | -9.082 |
| MASE | 0.721 | 0.887 | 0.756 | 1.271 | 1.063 |
These models give better results.
We had omitted an option that seemed to make no sense until then: the middle out. Indeed, combining a bottom-up method going from departments to TX1 stores and a top-down method going from all stores in the US to TX1 stores seemed promising.
Top level : all stores| ARIMA | ETS | |
|---|---|---|
| ME | 295.082 | 156.670 |
| RMSE | 914.796 | 1081.349 |
| MAE | 747.125 | 917.643 |
| MAPE | 8.166 | 10.078 |
| MPE | 2.407 | 0.369 |
| MASE | 0.720 | 0.885 |
Indeed, this was the method that gave us the lowest MASE.
The following graph shows the predicted values in red and the actual values in black. Unfortunately, since we used the middle out method we were not able to add a confidence interval with the code provided.
x <- FORECAST_ALL_mo_arima %>% aggts(levels = 2)
y <- HTS_te_lvl_0_to_4 %>% aggts(levels = 2)
x <- data.frame(x[,5], y[,5], 1:28)
colnames(x) <- c("prediction", "actual", "time")
ggplot() + geom_line(x, mapping = aes(x = time, y = prediction), color = "red") + geom_line(x, mapping = aes(x = time, y = actual)) + ylab("Sales") + xlab("Time") + ggtitle("Observations and Predictions")
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
In the Exploratory Data Analysis, we could see that our data had different seasonality. The weekly seasonality was particularly marked. So we decided to create a time series with a frequency of 7 days.
In the next section, we will try to fit an Arima model in different ways on the top level (TX1).
To start, we simply apply the auto.arima function to make an Arima automatically without any prior transformation. The approximation and stepwise parameters are set to “FALSE”.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(5,1,0)
## Q* = 437.2, df = 5, p-value < 2.2e-16
##
## Model df: 5. Total lags used: 10
The result is an ARIMA(5,1,0). Residues are normally distributed and centered around zero but we can see that there is still some dependency (ACF).
I then carry out a forecast for the next 28 days and calculate the accuracy using the validation set (which corresponds to the same period).
| ME | RMSE | MAE | MPE | MAPE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|
| Test set | 154.9081 | 499.8964 | 388.8319 | 2.665502 | 11.03722 | 0.283865 | 0.8823459 |
We could see in the previous test that our model did not capture all the information contained in the data. We will, therefore, carry out two new tests by performing transformations ourselves before applying the auto.arima function.
We will start by applying a differentiation with a lag 7.
We now have a time series centered around zero. We can also notice that we still have a visible seasonality on the time plot and more particularly on the ACF. This shows us a monthly seasonality (d = 30).
We will now apply an auto.arima and see if we get something better in terms of residues compared to our first trial.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,5) with zero mean
## Q* = 321.44, df = 5, p-value < 2.2e-16
##
## Model df: 5. Total lags used: 10
This time we have an ARIMA(5,0,1) which shows us that it did not apply any additional differentiation. For the residues, we still don’t get something satisfactory but the pattern is less clear than before. The Ljung-Box test confirms that the residues of our model do not come from a white noise.
We’ll see with the metrics if our hunch is confirmed.
| ME | RMSE | MAE | MPE | MAPE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|
| Test set | 21.85714 | 417.5612 | 350.7143 | 100 | 100 | 0.4590362 | 0.9624683 |
Looking at the accuracy metrics we can see that the RMSE goes from 499.89 to 417.56. So we have an improvement. However, it is important to note here that we did not apply the inverse transformation after forecasting; we simply applied a differentiation on our validation set in order to obtain the accuracy (we also loose 7 days on the graph for the observed data). So, on the graph we have something very linear for the predictions. Normally, an inverse transoformation must be applied to obtain the final predictions.
For this next test, we will take our initial time series, apply a differentiation with a lag of 30 and then make an arima.
This time, on the ACF, it is the weekly seasonality that appears. Which isn’t really a surprise.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(5,0,0) with non-zero mean
## Q* = 806.84, df = 4, p-value < 2.2e-16
##
## Model df: 6. Total lags used: 10
The auto.arima function gives us this time an ARIMA(5,0,0). Again, no further differentiation was applied. As in the previous tests, the Ljung-Box rejects H0 with a p-value of less than 0.05. This is also visible on the ACF.
| ME | RMSE | MAE | MPE | MAPE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|
| Test set | 27.02445 | 437.1928 | 346.4124 | 47.35097 | 103.1009 | 0.5029452 | 1.0649 |
The RMSE is worse than in our previous test with a value of 437.19.
We could see on different previous graphs that the store was closing at Christmas. Applying a 365 day differentiation would impact every point of our time series when we are just interested in that particular day. One solution is to use a regressor in our auto.arima function to take this information into account. This is what we will do in this section. We are also going to apply a 7 day differentiation since we have found that this gives us a better result.
| ME | RMSE | MAE | MPE | MAPE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|
| Test set | 23.81543 | 417.3098 | 350.5692 | 101.6098 | 101.6098 | 0.4656876 | 0.9444874 |
By adding the regressor the model is only slightly improved compared to our previous best prediction. Indeed, we obtain an RMSE of 417.31 instead of 417.56.
As a conclusion to this report, we have defined two models that seem to us conclusive: an auto arima with a differentiation of 7 and event regressors, as well as …. both models are difficult to compare, because the auto arima with differentiation calculates the precision on a validation set which has also been subjected to differentiation. Thus, the MASE, which is our index to compare the performance of our models, is not provided for this model.
how could we have gone further in the project? First of all, we could have included more parameters for the regressor of our model, such as including all the events of the year. For our models, we only included Christmas, which has a considerable effect since it is the only day of the year when the store is closed and therefore sales are at zero, as well as Independence Day which is the only event that will appear in the 28 days that we have to forecast. Then we could have taken into account the monthly seasonality in all the models. But as mentioned before, since the number of days is variable for each month, we could not simply apply a 28 or 30 day differentiation. Finally, another way to improve our predictions would have been to take a lower layer of our data, i.e. to make predictions down to each particular product, and not stop at the second level (dept_id).